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Abstract 

Word embeddings - distributed representa¬ 
tions of words - in deep learning are benefi¬ 
cial for many tasks in natural language pro¬ 
cessing (NLP). However, different embedding 
sets vary greatly in quality and characteris¬ 
tics of the captured semantics. Instead of re¬ 
lying on a more advanced algorithm for em¬ 
bedding learning, this paper proposes an en¬ 
semble approach of combining different pub¬ 
lic embedding sets with the aim of learning 
meta-embeddings. Experiments on word simi¬ 
larity and analogy tasks and on part-of-speech 
tagging show better performance of meta¬ 
embeddings compared to individual embed¬ 
ding sets. One advantage of meta-embeddings 
is the increased vocabulary coverage. We will 
release our meta-embeddings publicly. 

1 Introduction 

Recently, deep neural network (NN) models have 
achieved remarkable results in NLP (Collobert and 
Weston, 2008; Sutskever et al., 2014; Rocktaschel et 
al., 2015). One reason for these results are word em¬ 
beddings, compact distributed word representations 
learned in an unsupervised manner from large cor¬ 
pora (Bengio et al., 2003; Mnih and Hinton, 2009; 
Mikolov et al., 2010; Mikolov, 2012; Mikolov et al., 
2013a). 

Some prior work has studied differences in per¬ 
formance of different embedding sets. For exam¬ 
ple, Chen et al. (2013) showed that the embed¬ 
ding sets HLBL (Mnih and Hinton, 2009), SENNA 
(Collobert and Weston, 2008), Turian (Turian et al., 
2010) and Huang (Huang et al., 2012) have great 


variance in quality and characteristics of the seman¬ 
tics captured. Hill et al. (2014; 2015a) showed 
that embeddings learned by NN machine translation 
models can outperform three representative mono¬ 
lingual embedding sets: word2vec (Mikolov et al., 
2013b), GloVe (Pennington et al., 2014) and CW 
(Collobert and Weston, 2008). Bansal et al. (2014) 
found that Brown clustering, SENNA, CW, Huang 
and word2vec yield significant gains for dependency 
parsing. Moreover, using these representations to¬ 
gether achieved the best results, suggesting their 
complementarity. These prior studies motivate us to 
explore an ensemble approach. Since each embed¬ 
ding set is trained by a different NN on a different 
corpus and can be treated as a distinct description of 
words, our expectation is that the ensemble contains 
more information than each component embedding 
set. We want to leverage this diversity to learn better¬ 
performing word embeddings. 

The ensemble approach has two benefits. 
Eirst, enhancement of the representations: meta¬ 
embeddings perform better than the individual em¬ 
bedding sets. Second, coverage: meta-embeddings 
cover more words than the individual embedding 
sets. The first three ensemble methods we introduce 
are CONC, SVD and iTON and they directly 
only have the benefit of enhancement. They learn 
meta-embeddings on the overlapping vocabulary 
of the embedding sets. CONC concatenates the 
vectors of a word from the different embedding sets. 
SVD performs dimension reduction on this con¬ 
catenation. ItoN assumes that a meta-embedding 
for the word exists and uses this meta-embedding 
to predict representations of the word in the indi- 



vidual embedding sets - the resulting fine-tuned 
meta-embedding is expeeted to eontain knowledge 
from all individual embedding sets. 

To also address the objeetive of inereased eover- 
age of the voeabulary, we introduee iTON'*’, a mod- 
ifieation of iTON that learns meta-embeddings for 
all words in the vocabulary union in one step. Let 
an out-of-voeabulary (OOV) word w of embedding 
set ES be a word that is not eovered by ES (i.e., 
ES does not eontain an embedding for ru).^ 1toN+ 
first randomly initializes the embeddings for OOVs 
and the meta-embeddings, then uses a predietion 
setup similar to iTON to update meta-embeddings 
as well as OOV embeddings. Thus, 1toN+ si¬ 
multaneously aehieves two goals: learning meta¬ 
embeddings and extending the voeabulary (for both 
meta-embeddings and invidual embedding sets). 

An alternative method that inereases eoverage is 
MutualEearning. MutualEearning learns 
the embedding for a word that is an OOV in em¬ 
bedding set from its embeddings in other embed¬ 
ding sets. We will use MutualEearning to in- 
erease eoverage for CONC, SVD and iTON, so that 
these three methods (when used together with Mu¬ 
tualEearning) have the advantages of both per- 
formanee enhaneement and inereased eoverage. 

In summary, meta-embeddings have two benefits 
eompared to individual embedding sets: enhance¬ 
ment of performanee and improved coverage of the 
voeabulary. Below, we demonstrate this experimen¬ 
tally for three tasks: word similarity, word analogy 
and POS tagging. 

If we simply view meta-embeddings as a way of 
eoming up with better embeddings, then the alter¬ 
native is to develop a single embedding learning 
algorithm that produees better embeddings. Some 
improvements proposed before have the disadvan¬ 
tage of inereasing the training time of embedding 
learning substantially; e.g., the NNEM presented 
in (Bengio et al., 2003) is an order of magnitude 
less effieient than an algorithm like word2vee and, 
more generally, replaeing a linear objeetive funetion 
with a nonlinear objeetive funetion inereases train¬ 
ing time. Similarly, fine-tuning the hyperparameters 
of the embedding learning algorithm is eomplex and 

*We do not consider words in this paper that are not covered 
by any of the individual embedding sets. OOV always refers to 
a word that is covered by at least one embedding set. 


time eonsuming. In many eases, it is not possible to 
retrain using a different algorithm beeause the eor- 
pus is not publiely available. But even if these ob- 
staeles eould be overeome, it is unlikely that there 
ever will be a single “best” embedding learning al¬ 
gorithm. So the eurrent situation of multiple embed¬ 
ding sets with different properties being available 
is likely to persist for the forseeable future. Meta¬ 
embedding learning is a simple and effieient way 
of taking advantage of this diversity. As we will 
show below they eombine several eomplementary 
embedding sets and the resulting meta-embeddings 
are stronger than eaeh individual set. 

2 Related Work 

Related work has foeused on improving perfor¬ 
manee on speeifie tasks by using several embed¬ 
ding sets simultaneously. To our knowledge, there 
is no work that aims to learn generally useful meta¬ 
embeddings from individual embedding sets. 

Tsuboi (2014) ineorporated word2vee and GloVe 
embeddings into a POS tagging system and found 
that using these two embedding sets together was 
better than using them individually. Similarly, 
Turian et al. (2010) found that using Brown elusters, 
CW embeddings and HEBE embeddings for NER 
and ehunking tasks together gave better performanee 
than using these representations individually. 

Euo et al. (2014) adapted CBOW (Mikolov et 
al., 2013a) to train word embeddings on different 
datasets - a Wikipedia eorpus, seareh eliek-through 
data and user query data - for web seareh rank¬ 
ing and for word similarity. They showed that us¬ 
ing these embeddings together gives stronger results 
than using them individually. 

These papers show that using multiple embedding 
sets is benefieial. However, they either use embed¬ 
ding sets trained on the same eorpus (Turian et al., 
2010) or enhanee embedding sets by more training 
data, not by innovative learning algorithms (Euo et 
al., 2014). In our work, we ean leverage any pub¬ 
liely available embedding set learned by any learn¬ 
ing algorithm. Our meta-embeddings are generi- 
eally useful and are learned by supervised training of 
an explicit model of the dependencies between em¬ 
bedding sets and (exeept for CONC) not by simple 
concatenation. 



3 Experimental Embedding Sets 

In this work, we use five released embedding sets, 
(i) HLBL. Hierarchical log-bilinear (Mnih and Hin¬ 
ton, 2009) embeddings released by Turian et al. 
(2010);^ 246,122 word embeddings, 100 dimen¬ 
sions; training corpus: RCVl corpus (Reuters En¬ 
glish newswire, August 1996 - August 1997). (ii) 
Huang.^ Huang et al. (2012) incorporated global 
context to deal with challenges raised by words with 
multiple meanings; 100,232 word embeddings, 50 
dimensions; training corpus: April 2010 snapshot of 
Wikipedia, (hi) GloVe^ (Pennington et al., 2014). 
1,193,514 word embeddings, 300 dimensions; train¬ 
ing corpus: 42 billion tokens of web data, from 
Common Crawl, (iv) CW (Collobert and Weston, 
2008). Released by Turian et al. (2010);^ 268,810 
word embeddings, 200 dimensions; ttaining cor¬ 
pus: same as HLBL. (v) word2vec (Mikolov et al., 
2013b) CBOW;^ 929,022 word embeddings (we 
discard phrase embeddings), 300 dimensions; train¬ 
ing corpus: Google News (about 100 billion words). 

The intersection of the five vocabularies has size 
35,965, the union has size 2,788,636. 

4 Ensemble Methods 

This section introduces the four ensemble methods: 
CONC, SVD, ItoN and 1 toN+. 

4.1 CONC: Concatenation 

In CONC, the meta-embedding of w is the concate¬ 
nation of five embeddings, one each from the five 
embedding sets. Lor GloVe, we perform L2 normal¬ 
ization for each dimension across the vocabulary as 
recommended by the GloVe authors. Then each em¬ 
bedding of each embedding set is L2-normalized. 
This ensures that each embedding set contributes 
equally (a value between -1 and 1) when we com¬ 
pute similarity via dot product. 

We would like to make use of prior knowledge 
and give more weight to well performing embed¬ 
ding sets. In this work, we give GloVe and word2vec 

^http://metaoptimize.com/projects/ 
wordreprs/ 

^ http://ai.stanford.edu/~ehhuang/ 
"^http://nlp.stanford. edu/pro jects/glove/ 
^http://metaoptimize.com/projects/ 
wordreprs/ 

*http://code.google.com/p/Word2Vec/ 



Figure 1: Performance vs. Weight scalar i 

weight i > 1 and weight 1 to the other three embed¬ 
ding sets. We use MC30 (Miller and Charles, 1991) 
as dev set, since all embedding sets fully cover it. 
We set i = 8, the value in Ligure 1 where perfor¬ 
mance reaches a plateau. After L2 normalization, 
GloVe and word2vec embeddings are multiplied by 
i and remaining embedding sets are left unchanged. 

The dimensionality of CONC meta-embeddings 
isk = 100 -P 50 -P 300 -P 200 -P 300 = 950. 

4.2 SVD: Singular Value Decomposition 

We do SVD on above weighted concatenation vec¬ 
tors of dimension k = 950. 

Given a set of CONC representations for n words, 
each of dimensionality k, we compute an SVD de¬ 
composition C = U SV'^ of the corresponding nxk 
matrix C. We then use Ud, the first d dimensions of 
U, as the SVD meta-embeddings of the n words. We 
apply L2-normalization to embeddings; similarities 
of SVD vectors are computed as dot products. 

d denotes the dimensionality of meta-embeddings 
in SVD, ItoN and 1 toN+. We use d = 200 
throughout and investigate the impact of d below. 

4.3 ItoN 

Ligure 2 depicts the simple neural network we em¬ 
ploy to learn meta-embeddings in iTON. White rect¬ 
angles denote known embeddings. The target to 
learn is the meta-embedding (shown as shaded rect¬ 
angle). Meta-embeddings are initialized randomly. 
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Figure 2: ItoN 

















Let c be the number of embedding sets under con¬ 
sideration, Vi,V2,... ,Vi,... ,Vc their vocabularies 
and the intersection, used as train¬ 

ing set. Let K denote the meta-embedding space. 
We define a projection from space 14 to space Vi 
(f = 1, 2,..., c) as follows: 

Wi = (1) 

where G w* G is the meta¬ 

embedding of word w in space 14 and Wj G 
is the projected (or learned) representation of word 
w in space V). The training objective is minimizing 
the sum of (i) squared error: 

-E = ^|wi-Wi|^ (2) 

i 

and (ii) L2 cost (sum of squares) of the projection 
weights M*j. 

As for CONC and SVD, we weight GloVe and 
word2vec by i = 8. For iTON, we implement this 
by applying the factor i to the corresponding loss 
part of the squared error. 

The principle of iTON is that we treat each in¬ 
dividual embedding as a projection of the meta¬ 
embedding, similar to principal component analysis. 
An embedding is a description of the word based on 
the corpus and the model that were used to create it. 
The meta-embedding tries to recover a more com¬ 
prehensive description of the word when it is trained 
to predict the individual descriptions. 

1 ton can also be understood as a sentence mod¬ 
eling process, similar to DBOW (Le and Mikolov, 
2014). The embedding of each word in a sentence 
s is a partial description of s. DBOW combines 
all partial descriptions to form a comprehensive de¬ 
scription of s. DBOW initializes the sentence rep¬ 
resentation randomly, then uses this representation 
to predict the representations of individual words. 
The sentence representation of s corresponds to the 
meta-embedding in ItoN; and the representations 
of the words in s correspond to the five embeddings 
for a word in ItoN. 

4.4 1toN+ 

Recall fhaf an OOV (wifh respecf fo embedding sef 
ES) is defined as a word unknown in ES. 1 toN+ 
is an exfension of 1 TON fhaf learns embeddings for 


OOVs; fhus, if does nof have fhe limifafion fhaf if 
can only be run on overlapping vocabulary. 


embi emb2 emb3 emb4 embs 
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Figure 3: ltoN+ 

Eigure 3 depicfs 1 toN+. In confrasl fo Eigure 
2, we assume fhaf fhe currenf word is an OOV in 
embedding sefs 3 and 5. Hence, in fhe new learning 
fask, embeddings 1, 2,4 are known, and embeddings 
3 and 5 and fhe mefa-embedding are fargefs fo learn. 

We inifialize all OOV represenfafions and mefa- 
embeddings randomly and use fhe same map¬ 
ping formula as for iTON fo conned a mefa- 
embedding wifh fhe individual embeddings. Bofh 
mefa-embedding and initialized OOV embeddings 
are updated during fraining. 

Each embedding sef confains informalion abouf 
only a pari of fhe overall vocabulary. However, if 
can predicf whaf fhe remaining pari should look like 
by comparing words if knows wifh fhe information 
olher embedding sefs provide abouf Ihese words. 
Thus, 1 toN+ learns a model of fhe dependencies 
belween fhe individual embedding sefs and can use 
Ihese dependencies fo infer whaf fhe embedding of 
an OOV should look like. 

CONC, SVD and iTON compute mela- 
embeddings only for fhe inlerseclion vocabulary. 
1 toN+ compufes mefa-embeddings for fhe union of 
all individual vocabularies, fhus greally increasing 
fhe coverage of individual embedding sefs. 

5 MutualLearning 

MutualEearning is a melhod fhaf extends 
CONC, SVD and iTON such fhaf Ihey have in¬ 
creased coverage of fhe vocabulary. Wifh Mutu¬ 
alEearning, all four ensemble melhods - CONC, 
SVD, 1 ton and 1TON^" - have fhe benefils of bofh 
performance enhancemenl and increased coverage 
and we can use crileria like performance, compacl- 
ness and efficiency of fraining fo selecl fhe besl en¬ 
semble melhod for a parlicular applicalion. 










bs 

Ir 

L2 weight 

ItoN 

200 

0.005 

5 X 10-4 

MutualLearning (ml) 

200 

0.01 

5 X 10"® 

1toN+ 

2000 

0.005 

5 X 10-4 


Table 1: Training setup, bs: batch size; Ir: learning rate. 

MutualLearning is applied to learn OOV em¬ 
beddings for all c embedding sets; however, for ease 
of exposition, let us assume we want to eompute em¬ 
beddings for OOVs for embedding set j only, based 
on known embeddings in the other c — 1 embedding 
sets, with indexes i ^ {1... j — 1, j + 1... c]. We 
do this by learning c — 1 mappings fij, eaeh a pro- 
jeetion from embedding set Ei to embedding set Ej. 

Similar to Seetion 4.3, we train mapping fij on 
the interseetion Vi n Vj of the voeabularies eov- 
ered by the two embedding sets. Formally, wj = 
/ij(wj) = MjjWj where M^- G w* G 

denotes the representation of word w in spaee Vi and 
■Wj is the projeeted meta-embedding of word w in 
spaee Vj. Training loss has the same form as for 
ItoN. a total of c — 1 projeetions fij are trained to 
learn OOV embeddings for embedding set j. 

Let m be a word unknown in the voeabulary Vj 
of embedding set j, but known in Vi,V2, ..., 14 - 
To eompute an embedding for w in Vj, we first 
eompute the k projeetions /ij(wi), / 2 j(w 2 ), ..., 
fkji^k) from the souree spaees 14 , V 2 , • • •, to 
the target spaee Vj. Then, the element-wise aver¬ 
age of /ij(wi), / 2 j(w 2 ), ..., /fej(wfc) is treated as 
the representation of w in Vj. Our motivation is that 
- assuming there is a true representation of w in Vj 
and assuming the projeetions were learned well - we 
would expect all the projected vectors to be close 
to the true representation. Also, each source space 
contributes potentially complementary information. 
Hence averaging them is a balance of knowledge 
from all source spaces. 

6 Experiments 

We train NNs by back-propagation with AdaGrad 
(Duchi et al., 2011) and mini-hatches. Table 1 gives 
hyperparameters. 

We report results on three tasks: word similarity, 
word analogy and POS tagging. 


6.1 Word Similarity and Analogy Tasks 

We evaluate on SimLex-999 (Hill et al, 2015b), 
Wordsim353 (Finkelstein et al., 2001), RG (Ruben- 
stein and Goodenough, 1965) and RW (Luong et al., 
2013). For completeness, we also show results for 
MC30, the validation set. 

The word analogy task proposed in (Mikolov et 
al, 2013b) consists of questions like, “a is to 6 as c is 
to _ ?”. The dataset contains 19,544 such questions, 
divided into a semantic subset of size 8869 and a 
syntactic subset of size 10,675. 

Table 2 gives results on similarity and anal¬ 
ogy. Numbers in parentheses are line numbers in 
what follows. Block “ind-full” (1-5) lists the perfor¬ 
mance of individual embedding sets on the full vo¬ 
cabulary. Results on lines 6-34 are for the intersec¬ 
tion of the vocabularies of the five embedding sefs: 
“ind-overlap” confains fhe performance of individ¬ 
ual embedding sefs, “ensemble” fhe performance of 
our four ensemble mefhods and “discard” fhe perfor¬ 
mance when one componenf sef is removed. 

The four ensemble approaches are very promis¬ 
ing (31-34). For CONG, discarding HLBL, Huang 
or CW does nof hurl performance: CONG (31), 
CONC(-HLBL) (11), CONC(-Huang) (12) and 
CONC(-CW) (14) heal each individual embedding 
sef (6-10) in all lasks. GloVe conlribufes mosl in 
SimLex-999, WS353, MC30 and RG; word2vec 
conlribufes mosl in RW and word analogy lasks. 

SVD (32) reduces Ihe dimensionalily of CONG 
from 950 lo 200, bul slill gains performance in 
SimLex-999 and RG. GloVe conlribufes mosl in 
SVD (larger losses on line 18 vs. lines 16-17, 19- 
20). Olher embeddings conlribule inconsislenlly. 

1 ton performs well only on word analogy, bul il 
gains greal improvemenl when discarding CW em¬ 
beddings (24). 1 toN+ performs heller lhan iTON: 
il has slronger resulls when considering all embed¬ 
ding sefs, and can slill oulperform individual embed¬ 
ding sels while discarding HLBL (26), Huang (27) 
or CW (29). 

These resulls demonslrale lhal ensemble melhods 
using multiple embedding sels produce slronger em¬ 
beddings. However, il does nol mean Ihe more em¬ 
bedding sels Ihe heller. Whelher an embedding sel 
helps, depends on Ihe complemenlarily among Ihe 
sels as well as how we measure Ihe ensemble resulls. 




Model 

SL999 

WS353 

MC30 

RG 

RW 

sem. 

syn. 

tot. 


1 

HLBL 

22.1 (1) 

35.7 (3) 

41.5 (0) 

35.2 (1) 

19.1 (892) 

27.1 (423) 

22.8 (198) 

24.7 

3 

2 

Huang 

9.7 (3) 

61.7 (18) 

65.9 (0) 

63.0 (0) 

6.4 (982) 

8.4 (1016) 

11.9 (326) 

10.4 

'-l-H 

’O 

3 

GloVe 

45.3 (0) 

75.4 (18) 

83.6 (0) 

82.9 (0) 

48.7 (21) 

81.4 (0) 

70.1 (0) 

75.2 

.B 

4 

CW 

15.6 (1) 

28.4 (3) 

21.7 (0) 

29.9 (1) 

15.3 (896) 

17.4 (423) 

5.0 (198) 

10.5 


5 

W2V 

44.2 (0) 

69.8 (0) 

78.9 (0) 

76.1 (0) 

53.4 (209) 

77.1 (0) 

74.4 (0) 

75.6 

p- 

6 

HLBL 

22.3 (3) 

34.8 (21) 

41.5 (0) 

35.2 (1) 

22.2 (1212) 

13.8 (8486) 

15.4 (1859) 

15.4 

cd 

7 

Huang 

9.7 (3) 

62.0 (21) 

65.9 (0) 

64.1 (1) 

3.9 (1212) 

27.9 (8486) 

9.9 (1859) 

10.7 

> 

O 

8 

GloVe 

45.0 (3) 

75.5 (21) 

83.6 (0) 

82.4 (1) 

59.1 (1212) 

91.1 (8486) 

68.2 (1859) 

69.2 

1 

*0 

c 

9 

CW 

16.0 (3) 

30.8 (21) 

21.7 (0) 

29.9 (1) 

17.4 (1212) 

11.2 (8486) 

2.3 (1859) 

2.7 


10 

W2V 

44.1 (3) 

69.3 (21) 

78.9 (0) 

75.4 (1) 

61.5 (1212) 

89.3 (8486) 

72.6 (1859) 

73.3 


11 

CONC (-HLBL) 

46.0 (3) 

76.5 (21) 

86.3 (0) 

82.5 (1) 

63.0 (1211) 

93.2 (8486) 

74.0 (1859) 

74.8 


12 

CONC (-Huang) 

46.1 (3) 

76.5 (21) 

86.3 (0) 

82.5 (1) 

62.9 (1212) 

93.2 (8486) 

74.0 (1859) 

74.8 


13 

CONC (-GloVe) 

44.0 (3) 

69.4 (21) 

79.1 (0) 

75.6 (1) 

61.5 (1212) 

89.3 (8486) 

72.7 (1859) 

73.4 


14 

CONC (-CW) 

46.0 (3) 

76.5 (21) 

86.6 (0) 

82.5 (1) 

62.9 (1212) 

93.2 (8486) 

73.9 (1859) 

74.7 


15 

CONC (-W2V) 

45.0 (3) 

75.5 (21) 

83.6 (0) 

82.4 (1) 

59.1 (1212) 

90.9 (8486) 

68.3 (1859) 

69.2 


16 

SVD (-HLBL) 

48.5 (3) 

76.1 (21) 

85.6 (0) 

82.7 (1) 

61.5 (1211) 

90.6 (8486) 

69.5 (1859) 

70.4 


17 

SVD (-Huang) 

48.8 (3) 

76.5 (21) 

85.4 (0) 

83.5 (1) 

61.7 (1212) 

91.4 (8486) 

69.8 (1859) 

70.7 


18 

SVD (-GloVe) 

46.2 (3) 

66.9 (21) 

81.6 (0) 

78.6 (1) 

59.1 (1212) 

88.8 (8486) 

67.3 (1859) 

68.2 


19 

SVD (-CW) 

48.5 (3) 

76.1 (21) 

85.7 (0) 

82.7 (1) 

61.5 (1212) 

90.6 (8486) 

69.5 (1859) 

70.4 

S-i 

o 

20 

SVD (-W2V) 

49.4 (3) 

79.0 (21) 

87.3 (0) 

80.7 (1) 

59.1 (1212) 

90.3 (8486) 

66.0 (1859) 

67.1 


21 

ItoN (-HLBL) 

46.3 (3) 

75.5 (21) 

82.4 (0) 

81.0 (1) 

60.1 (1211) 

91.9 (8486) 

75.9 (1859) 

76.5 


22 

ItoN (-Huang) 

46.5 (3) 

75.4 (21) 

82.4 (0) 

82.3 (1) 

60.2 (1212) 

93.5 (8486) 

76.3 (1859) 

77.0 


23 

ItoN (-GloVe) 

43.4 (3) 

66.5 (21) 

76.5 (0) 

75.3 (1) 

56.5 (1212) 

89.0 (8486) 

73.8 (1859) 

74.5 


24 

ItoN (-CW) 

47.4 (3) 

76.5 (21) 

84.8 (0) 

83.4 (1) 

62.0 (1212) 

91.4 (8486) 

73.1 (1859) 

73.8 


25 

ItoN (-W2V) 

46.3 (3) 

75.9 (21) 

80.1 (0) 

78.6 (1) 

56.8 (1212) 

92.2 (8486) 

72.2 (1859) 

73.0 


26 

1toN+ (-HLBL) 

46.1 (3) 

75.8 (21) 

85.5 (0) 

83.3 (1) 

62.3 (1211) 

92.2 (8486) 

76.2 (1859) 

76.9 


27 

1toN+ (-Huang) 

46.2 (3) 

76.1 (21) 

86.3 (0) 

83.3 (1) 

62.2 (1212) 

93.8 (8486) 

76.1 (1859) 

76.8 


28 

1toN+ (-GloVe) 

45.3 (3) 

71.2 (21) 

80.0 (0) 

75.7 (1) 

62.5 (1212) 

90.0 (8486) 

73.3 (1859) 

74.0 


29 

1toN+ (-CW) 

46.9 (3) 

78.1 (21) 

85.5 (0) 

83.9 (1) 

62.7 (1212) 

91.8 (8486) 

73.3 (1859) 

74.1 


30 

1toN+ (-W2V) 

45.8 (3) 

76.2 (21) 

84.4 (0) 

83.1 (1) 

60.9 (1212) 

92.4 (8486) 

72.4 (1859) 

73.2 


31 

CONC 

46.0 (3) 

76.5 (21) 

86.3 (0) 

82.5 (1) 

62.9 (1212) 

93.2 (8486) 

74.0 (1859) 

74.8 

X) 

a 

32 

SVD 

48.5 (3) 

76.0 (21) 

85.7 (0) 

82.7 (1) 

61.5 (1212) 

90.6 (8486) 

69.5 (1859) 

70.4 

o 

(73 

c 

33 

ItoN 

46.4 (3) 

74.5 (21) 

80.7 (0) 

80.7 (1) 

60.1 (1212) 

91.9 (8486) 

76.1 (1859) 

76.8 

o 

34 

1toN+ 

46.3 (3) 

75.3 (21) 

85.2 (0) 

82.7 (1) 

61.6 (1212) 

92.5 (8486) 

76.3 (1859) 

77.0 


Table 2: Results on five word similarity tasks and analogical reasoning. The number of OOVs is given in parentheses for each result, 
“ind-full/ind-overlap”: individual embedding sets with respective full/overlapping vocabulary; “ensemble”: ensemble results using 
all five embedding sets; “discard”: one of the five embedding sets is removed. If a result is better than all methods in “ind-overlap”, 
then it is bolded. 



RW(21) 

RND AVG ml 1 toN+ 

semantic 

RND AVG ml 1 toN+ 

syntactic 

RND AVG ml 1 toN+ 

total 

RND AVG ml 1 toN+ 

HLBL 

7.4 

6.9 17.3 

17.5 

26.3 

26.4 26.3 

26.4 

22.4 

22.4 22.7 

22.9 

24.1 

24.2 24.4 

24.5 

c Huang 

4.4 

4.3 6.4 

6.4 

1.2 

2.7 21.8 

22.0 

7.7 

4.1 10.9 

11.4 

4.8 

3.3 15.8 

16.2 

CW 

7.1 

10.6 17.3 

17.7 

17.2 

17.2 16.7 

18.4 

4.9 

5.0 5.0 

5.5 

10.5 

10.5 10.3 

11.4 

u CONC 

14.2 

16.5 48.3 

_ 

4.6 

18.0 88.1 

_ 

62.4 

15.1 74.9 

_ 

36.2 

16.3 81.0 

_ 

a SVD 

12.4 

15.7 47.9 

- 

4.1 

17.5 87.3 


54.3 

13.6 70.1 

- 

31.5 

15.4 77.9 


1 ItoN 

16.7 

11.7 48.5 


4.2 

17.6 88.2 

_ 

60.0 

15.0 76.8 

_ 

34.7 

16.1 82.0 

_ 

“ 1toN+ 

- 

- 

48.8 

- 

- 

88.4 

- 

- 

76.3 

- 

- 

81.1 


Table 3: Comparison of effectiveness of four methods for learning OOV embeddings. RND: random initialization. AVG: average 
of embeddings of known words, ml: MutualLearning. RW(21) means there are still 21 OOVs for the vocabulary union. 



CONC, the simplest ensemble, has robust perfor- 
manee. However, using embeddings of size 950 as 
input may mean too many parameters to tune for 
deep learning. The other three methods - SVD, 
ItoN, ItoN'*' - all have the advantage of smaller 
dimensionality. SVD reduees CONC’s dimension¬ 
ality dramatieally and still keeps eompetitive per- 
formanee, espeeially on word similarity. iTON is 
eompetitive on analogy, but weak on word similar¬ 
ity. 1 toN+ performs eonsistently strongly on word 
similarity and analogy. 

Not all state-of-the-art results are ineluded in Ta¬ 
ble 2. One reason is that a fair eomparison is 
only possible on the shared voeabulary, so methods 
without released embeddings eannot be ineluded. 
However, GloVe and word2vee are widely reeog- 
nized as state-of-the-art embeddings. In any ease, 
our main eontribution is to present ensemble frame¬ 
works whieh show that a eombination of eomple- 
mentary embedding sets produees better-performing 
meta-embeddings. 

System comparison of learning OOV embed¬ 
dings. In Table 3, we extend the voeabularies of 
eaeh individual embedding set (“ind” bloek) and our 
ensemble approaehes (“ensemble” bloek) to the vo¬ 
eabulary union, reporting results on RW and anal¬ 
ogy - these tasks eontain the most OOVs. As 
both word2vee and GloVe have full eoverage on 
analogy, we do not rereport them in this table. 
For eaeh embedding set, we ean eompute the rep¬ 
resentation of an OOV (i) as a randomly initial¬ 
ized veetor (RND); (ii) as the average of embed¬ 
dings of all known words (AVG); (iii) by MUTU- 
alLearning (ml) and (iv) by 1toN+. 1toN+ 
learns OOV embeddings for individual embedding 
sets and meta-embeddings simultaneously, and it 
would not make sense to replaee these OOV embed¬ 
dings eomputed by IxoN'*' with embeddings eom- 
puted by “RND/AVG/ml”. Henee, we do not report 
“RND/AVG/ml” results for 1toN+. 

Table 3 shows four interesting aspeets. (i) MUTU- 
alLearning helps mueh if an embedding set has 
lots of OOVs in eertain task; e.g., MutualLearn- 
ING is mueh better than AVG and RND on RW, and 
outperforms RND eonsiderably for CONC, SVD 
and 1 ton on analogy. However, it eannot make big 
differenee for HLBL/CW on analogy, probably be- 
eause these two embedding sets have mueh fewer 


OOVs, in whieh ease AVG and RND work well 
enough, (ii) AVG produees bad results for CONC, 
SVD and iTON on analogy, espeeially in the syn- 
taetie subtask. We notiee that those systems have 
large numbers of OOVs in word analogy task. If for 
analogy “a is to 6 as c is to d”, all four of a, b, c, d 
are OOVs, then they are represented with the same 
average veetor. Henee, similarity between b — a + c 
and eaeh OOV is 1.0. In this ease, it is almost im¬ 
possible to prediet the eorreet answer d. Unfortu¬ 
nately, methods CONC, SVD and 1 TON have many 
OOVs, resulting in the low numbers in Table 3. (iii) 
MutualLearning learns very effeetive embed¬ 
dings for OOVs. CONC-ml, iTON-ml and SVD- 
ml all get better results than word2vee and GloVe 
on analogy (e.g., for semantie analogy: 88.1, 87.3, 

88.2 vs. 81.4 for GloVe). Considering further their 
bigger voeabulary, these ensemble methods are very 
strong representation learning algorithms, (iv) The 
performanee of 1 toN+ for learning embeddings for 
OOVs is eompetitive with MutualLearning. For 
HLBL/Huang/CW, 1 TON'*' performs slightly better 
than MutualLearning in all four metries. Com¬ 
paring iTON-ml with IxoN'*', IxoN'*" is better than 
“ml” on RW and semantie task, while performing 
worse on syntaetie task. 

Figure 4 shows the influence of dimensionality 

d for SVD, ItoN and iTON'*'. Peak performanee 
for different data sets and methods is reaehed for 
d G [100,500]. There are no big differenees in 
the averages aeross data sets and methods for high 
enough d, roughly in the interval [150, 500]. In sum¬ 
mary, as long as d is ehosen to be large enough (e.g., 
> 150), performanee is robust. 

We will release the meta-embeddings produeed 
by methods SVD, iTON and 1 toN+ for d = 200 
and also the meta-embeddings for method CONC. 

6.2 Domain Adaptation for POS Tagging 

This seetion evaluates individual embedding sets 
and meta-embeddings on POS tagging. FLORS 
(Sehnabel and Sehiitze, 2014), the best perform¬ 
ing POS tagger for unsupervised domain adapta¬ 
tion, aets as testbed, in whieh POS tagging is a 
window-based, multilabel elassifieation problem us¬ 
ing a linear SVM. A word’s representation eonsists 
of four feature veetors based on suffix, shape, left 
and right distributional neighbors respeetively. We 




(a) Performance vs. d of SVD (b) Performance vs. d of iTON (c) Performance vs. d of iTON^ 


Figure 4: Influence of dimensionality 



newsgroups 

ALL OOV 

reviews 

ALL OOV 

weblogs 
ALL OOV 

answers 

ALL OOV 

emails 

ALL OOV 

wsj 

ALL OOV 


TnT 

88.66 

54.73 

90.40 

56.75 

93.33 

74.17 

88.55 

48.32 

88.14 

58.09 

95.76 88.30 

s 

c 

Stanford 

89.11 

56.02 

91.43 

58.66 

94.15 

77.13 

88.92 

49.30 

88.68 

58.42 

96.83 90.25 

15 

SVMTool 

89.14 53.82 

91.30 

54.20 

94.21 

76.44 

88.96 

47.25 

88.64 

56.37 

96.63 87.96 

c3 

C&P 

89.51 

57.23 

91.58 

59.67 

94.41 

78.46 

89.08 

48.46 

88.74 

58.62 

96.78 88.65 


FLORS 

90.86 

66.42 

92.95 

75.29 

94.71 

83.64 

90.30 

62.15 

89.44 

62.61 

96.59 90.37 


FLORSh-HLBL 

90.01 

62.64 

92.54 

74.19 

94.19 

79.55 

90.25 

62.06 

89.33 

62.32 

96.53 91.03 


FLORS-bHuang 

90.68 

68.53 

92.86 

77.88 

94.71 

84.66 

90.62 

65.04 

89.62 

64.46 

96.65 91.69 

_C 

FLORS-bGloVe 

90.99 

70.64 

92.84 

78.19 

94.69 

86.16 

90.54 

65.16 

89.75 

65.61 

96.65 92.03 

+ 

FLORS-bCW 

90.37 

69.31 

92.56 

77.65 

94.62 

84.82 

90.23 

64.97 

89.32 

65.75 

96.58 91.36 


FLORS-bW2V 

90.72 

72.74 

92.50 

77.65 

94.75 

86.69 

90.26 

64.91 

89.19 

63.75 

96.40 91.03 


FLORS-bCONC 

91.87 72.64 

92.92 

78.34 

95.37 

86.69 

90.69 

65.77 

89.94 

66.90 

97.31 92.69 

0 

FLORS-bSVD 

90.98 

70.94 

92.47 

77.88 

94.50 

86.49 

90.75 

64.85 

89.88 

65.99 

96.42 90.36 

B 

+ 

FLORS-blTON 

91.53 72.84 

93.58 

78.19 

95.65 

87.62 

91.36 

65.36 

90.31 

66.48 

97.66 92.86 


FLORS-blTON+ 

91.86 73.36 

93.14 

78.77 

95.65 

87.29 

91.73 

66.28 

90.53 

66.72 

97.75 92.55 


Table 4: POS tagging results on six target domains, “baselines” lists representative systems for this task, including FLOPS. “+indiv 
/ +meta”: FLOPS with individual embedding set / meta-embeddings. Bold means higher than “baselines” and “+indiv”. 


insert word’s embedding as the fifth feature vec¬ 
tor. All embedding sets (except for 1toN+) are ex¬ 
tended to the union vocabulary by MutualLeapn- 
ING. We follow Schnabel and Schiitze (2014) for all 
feature learning and also train on sections 2-21 of 
Wall Street Journal (WSJ) and evaluate on the de¬ 
velopment sets of six different target domains: five 
SANCL (Petrov and McDonald, 2012) domains - 
newsgroups, weblogs, reviews, answers, emails - 
and sections 22-23 of WSJ for in-domain testing. 

Table 4 gives results for some representative sys¬ 
tems (“baselines”), FLORS with individual em¬ 
bedding sets (“-bindiv”) and FLORS with meta¬ 
embeddings (“-bmeta”). Following conclusions can 
be drawn, (i) Not all individual embedding sets 
are beneficial in Ibis fask; e.g., HLBL embeddings 
make FLORS perform worse in 11 ouf of 12 cases, 
(ii) However, in mosf cases, embeddings improve 


sysfem performance, which is consisfenf wifh prior 
work on using embeddings for fhis fype of fask 
(Xiao and Guo, 2013; Yang and Eisensfein, 2014; 
Tsuboi, 2014). (iii) Mefa-embeddings generally 
help more fhan fhe individual embedding sefs, ex- 
cepf for SVD (which only performs belter in 3 out 
of 12 cases). 

7 Conclusion 

This work presented four ensemble methods - 
CONG, SVD, ItoN and 1toN+ - for learning 
meta-embeddings from multiple embedding sets. 
Experiments on word similarity, word analogy and 
POS tagging indicated the high quality of these 
meta-embeddings. The ensemble methods have the 
added advantage of increasing vocabulary coverage. 
We will release the meta-embeddings. 
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